A Normalizer for UGC in Brazilian Portuguese
نویسندگان
چکیده
User-generated contents (UGC) represent an important source of information for governments, companies, political candidates and consumers. However, most of the Natural Language Processing tools and techniques are developed from and for texts of standard language, and UGC is a type of text especially full of creativity and idiosyncrasies, which represents noise for NLP purposes. This paper presents UGCNormal, a lexicon-based tool for UGC normalization. It encompasses a tokenizer, a sentence segmentation tool, a phonetic-based speller and some lexicons, which were originated from a deep analysis of a corpus of product reviews in Brazilian Portuguese. The normalizer was evaluated in two different data sets and carried out from 31% to 89% of the appropriate corrections, depending on the type of text noise. The use of UGCNormal was also validated in a task of POS tagging, which improved from 91.35% to 93.15% in accuracy and in a task of opinion classification, which improved the average of F1-score measures (F1-score positive and F1-score negative) from 0.736 to 0.758.
منابع مشابه
‘Minor’ Languages, ‘Broken’ Translations: On Brazilian Reworkings of an Albanian Novel
This essay approaches the challenges of global translation in the 21st century from what might still be considered a somewhat uncommon example: a direct translation of Ismail Kadaré's 1978 novel Prill e thyër (Broken April) from the original Albanian into Brazilian Portuguese in 2001. Not only does it examine and compare lexical elements in the source and target texts and the usage of translato...
متن کاملExploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization
Text normalization techniques based on rules, lexicons or supervised training requiring large corpora are not scalable nor domain interchangeable, and this makes them unsuitable for normalizing user-generated content (UGC). Current tools available for Brazilian Portuguese make use of such techniques. In this work we propose a technique based on distributed representation of words (or word embed...
متن کاملDIXI - portuguese text-to-speech system
This paper describes the software architecture of the Portuguese text-to-speech system DIXI. The system has three major modules. The rst one contains the text normalizer and searches each word in the lexicon. The second one is a multi-level rule based module for lexical stress assignment, orthographic to phonetic transcription, metrically based prosodic patterning and for generating the evoluti...
متن کاملCyclic Orbit Codes with the Normalizer of a Singer Subgroup
An algebraic construction for constant dimension subspace codes is called orbit code. It arises as the orbits under the action of a subgroup of the general linear group on subspaces in an ambient space. In particular orbit codes of a Singer subgroup of the general linear group has investigated recently. In this paper, we consider the normalizer of a Singer subgroup of the general linear group a...
متن کاملTranslation, cultural adaptation and validation for Brazilian Portuguese of the Cardiff Acne Disability Index instrument*
BACKGROUND The Cardiff Acne Disability Index was originally developed in English for measuring quality of life of acne patients. Considering the psychosocial impact of this disease, it is important to have instruments culturally and linguistically validated for use in Brazilian adolescents. OBJECTIVE To translate the Cardiff Acne Disability Index into Brazilian Portuguese, culturally adapt it...
متن کامل